{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.1 単語分割（Janome、MeCab+NEologd）\n",
    "\n",
    "- 本ファイルでは、JanomeもしくはMeCab+NEologdを使用して分かち書きします\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "※　本章のファイルはすべてUbuntuでの動作を前提としています。Windowsなど文字コードが違う環境での動作にはご注意下さい。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.1 学習目標\n",
    "\n",
    "1.\t機械学習における自然言語処理の流れを理解する\n",
    "2.\tJanomeおよびMeCab+NEologdを用いた形態素解析を実装できるようになる\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 事前準備\n",
    "\n",
    "- 書籍の指示に従い、本章で使用するデータを用意します\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1 . 単語へ分割：Tokenizer\n",
    "\n",
    "分かち書きをする部分を作成します\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 Janome\n",
    "\n",
    "公式サイト\n",
    "\n",
    "https://mocobeta.github.io/janome/\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Janomeのインストール方法\n",
    "\n",
    "コンソールにて、\n",
    "\n",
    "- source activate pytorch_p36\n",
    "- pip install janome"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "機械\t名詞,一般,*,*,*,*,機械,キカイ,キカイ\n",
      "学習\t名詞,サ変接続,*,*,*,*,学習,ガクシュウ,ガクシュー\n",
      "が\t助詞,格助詞,一般,*,*,*,が,ガ,ガ\n",
      "好き\t名詞,形容動詞語幹,*,*,*,*,好き,スキ,スキ\n",
      "です\t助動詞,*,*,*,特殊・デス,基本形,です,デス,デス\n",
      "。\t記号,句点,*,*,*,*,。,。,。\n"
     ]
    }
   ],
   "source": [
    "from janome.tokenizer import Tokenizer\n",
    "\n",
    "j_t = Tokenizer()\n",
    "\n",
    "text = '機械学習が好きです。'\n",
    "\n",
    "for token in j_t.tokenize(text):\n",
    "    print(token)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['機械', '学習', 'が', '好き', 'です', '。']\n"
     ]
    }
   ],
   "source": [
    "# 単語分割する関数を定義\n",
    "\n",
    "\n",
    "def tokenizer_janome(text):\n",
    "    return [tok for tok in j_t.tokenize(text, wakati=True)]\n",
    "\n",
    "\n",
    "text = '機械学習が好きです。'\n",
    "print(tokenizer_janome(text))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 MeCab\n",
    "\n",
    "公式サイト\n",
    "\n",
    "http://taku910.github.io/mecab/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MeCab+NEologdのインストール方法\n",
    "\n",
    "1. MeCabのインストール\n",
    "\n",
    "sudo apt install mecab\n",
    "\n",
    "sudo apt install libmecab-dev\n",
    "\n",
    "sudo apt install mecab-ipadic-utf8\n",
    "\n",
    "\n",
    "2. NEologd のインストール\n",
    "\n",
    "git clone https://github.com/neologd/mecab-ipadic-neologd.git\n",
    "\n",
    "cd mecab-ipadic-neologd\n",
    "\n",
    "sudo bin/install-mecab-ipadic-neologd\n",
    "\n",
    "(途中で止まり、\n",
    "Do you want to install mecab-ipadic-NEologd? Type yes or no.\n",
    "と聞かれたら、yesと入力)\n",
    "\n",
    "3. PythonからMeCabを使用できるようにする \n",
    "\n",
    "conda install -c anaconda swig\n",
    "\n",
    "pip install mecab-python3\n",
    "\n",
    "cd ..\n",
    "\n",
    "jupyter notebook --port 9999\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "機械\tキカイ\t機械\t名詞-一般\t\t\n",
      "学習\tガクシュウ\t学習\t名詞-サ変接続\t\t\n",
      "が\tガ\tが\t助詞-格助詞-一般\t\t\n",
      "好き\tスキ\t好き\t名詞-形容動詞語幹\t\t\n",
      "です\tデス\tです\t助動詞\t特殊・デス\t基本形\n",
      "。\t。\t。\t記号-句点\t\t\n",
      "EOS\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import MeCab\n",
    "\n",
    "m_t = MeCab.Tagger('-Ochasen')\n",
    "\n",
    "text = '機械学習が好きです。'\n",
    "\n",
    "print(m_t.parse(text))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "機械学習\tキカイガクシュウ\t機械学習\t名詞-固有名詞-一般\t\t\n",
      "が\tガ\tが\t助詞-格助詞-一般\t\t\n",
      "好き\tスキ\t好き\t名詞-形容動詞語幹\t\t\n",
      "です\tデス\tです\t助動詞\t特殊・デス\t基本形\n",
      "。\t。\t。\t記号-句点\t\t\n",
      "EOS\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import MeCab\n",
    "\n",
    "m_t = MeCab.Tagger('-Ochasen -d /usr/lib/mecab/dic/mecab-ipadic-neologd')\n",
    "\n",
    "text = '機械学習が好きです。'\n",
    "\n",
    "print(m_t.parse(text))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['機械学習', 'が', '好き', 'です', '。']\n"
     ]
    }
   ],
   "source": [
    "# 単語分割する関数を定義\n",
    "\n",
    "m_t = MeCab.Tagger('-Owakati -d /usr/lib/mecab/dic/mecab-ipadic-neologd')\n",
    "\n",
    "\n",
    "def tokenizer_mecab(text):\n",
    "    text = m_t.parse(text)  # これでスペースで単語が区切られる\n",
    "    ret = text.strip().split()  # スペース部分で区切ったリストに変換\n",
    "    return ret\n",
    "\n",
    "\n",
    "text = '機械学習が好きです。'\n",
    "print(tokenizer_mecab(text))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以上"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
